Nano‐Strainer: A workflow for the identification of single‐copy nuclear loci for plant systematic studies, using target capture kits and Oxford Nanopore long reads

Abstract In modern plant systematics, target enrichment enables simultaneous analysis of hundreds of genes. However, when dealing with reticulate or polyploidization histories, few markers may suffice, but often are required to be single‐copy, a condition that is not necessarily met with commercial capture kits. Also, large genome sizes can render target capture ineffective, so that amplicon sequencing would be preferable; however, knowledge about suitable loci is often missing. Here, we present a comprehensive workflow for the identification of putative single‐copy nuclear markers in a genus of interest, by mining a small dataset from target capture using a few representative taxa. The proposed pipeline assesses sequence variability contained in the data from targeted loci and assigns reads to their respective genes, via a combined BLAST/clustering procedure. Cluster consensus sequences are then examined based on four pre‐defined criteria presumably indicative for absence of paralogy. This is done by calculating four specialized indices; loci are ranked according to their performance in these indices, and top‐scoring loci are considered putatively single‐ or low copy. The approach can be applied to any probe set. As it relies on long reads, the present contribution also provides template workflows for processing Nanopore‐based target capture data. Obtained markers are further tested and then entered into amplicon sequencing. For the detection of possibly remaining paralogy in these data, which might occur in groups with rampant paralogy, we also employ the long‐read assembly tool canu. In diploid representatives of the young Compositae genus Leucanthemum, characterized by high levels of polyploidy, our approach resulted in successful amplification of 13 loci. Modifications to remove traces of paralogy were made in seven of these. A species tree from the markers correctly reproduced main relationships in the genus, however, at low resolution. The presented workflow has the potential to valuably support phylogenetic research, for example in polyploid plant groups.

sive workflow for the identification of putative single-copy nuclear markers in a genus of interest, by mining a small dataset from target capture using a few representative taxa. The proposed pipeline assesses sequence variability contained in the data from targeted loci and assigns reads to their respective genes, via a combined BLAST/ clustering procedure. Cluster consensus sequences are then examined based on four pre-defined criteria presumably indicative for absence of paralogy. This is done by calculating four specialized indices; loci are ranked according to their performance in these indices, and top-scoring loci are considered putatively single-or low copy.
The approach can be applied to any probe set. As it relies on long reads, the present contribution also provides template workflows for processing Nanopore-based target capture data. Obtained markers are further tested and then entered into amplicon sequencing. For the detection of possibly remaining paralogy in these data, which might occur in groups with rampant paralogy, we also employ the long-read assembly tool Canu. In diploid representatives of the young Compositae genus Leucanthemum, characterized by high levels of polyploidy, our approach resulted in successful amplification of 13 loci. Modifications to remove traces of paralogy were made in seven of these. A species tree from the markers correctly reproduced main relationships in the genus, however, at low resolution. The presented workflow has the potential to valuably support phylogenetic research, for example in polyploid plant groups.

K E Y W O R D S
CompCOS loci, Leucanthemum, polyploidization, probe kit, reticulation

| INTRODUC TI ON
In recent years, Next-Generation Sequencing (NGS; Levy & Myers, 2016) has progressively enabled the analysis of complete plant or animal genomes for systematic research; to date, 2661 land plant and 10,670 metazoan genomic assemblies of varying quality are available in the Genome database at NCBI (NCBI Genome, 2023). However, conflicting signals among individual gene histories can distort the results of a joint analysis of a large number of genes. This is particularly prevalent in young lineages and those affected by polyploidization or reticulation, where hybridization or effects like incomplete lineage sorting (ILS) are prevalent (de Sousa et al., 2016;Knowles et al., 2018).
Coalescent-based approaches aiming at species tree reconstruction or constructing phylogenetic networks successfully deal with the problem but tend to be computationally intensive, which, in certain cases, may limit the number of loci that can be included in the analysis.
Genome reduction methods are suitable alternatives to avoid this issue, for example, high-throughput amplicon sequencing (Brouwer et al., 2018). Another powerful approach is target enrichment via in-solution hybridization, based on probes targeting a collection of loci. Specialized probe sets exist for various organism lineages and on various scales, from clade-specific ones (Mascher et al., 2013;Villaverde et al., 2018) to those universally applicable in large groups like, for example, all Angiosperms ("Angiosperms353", Johnson et al., 2019). However, target capture with commercial kits can be quite cost-intensive, especially as in lineages featuring larger genomes, the efficiency of capture probes can become unfeasibly low. Jones et al. (2019), for example, reported the proportion of ontarget reads to be only 1.92% for the sample with the highest genome size (16.25 1C pg, i.e., approximately 15.9 Gbp in the haploid phase) included in their study.
Sequencing such large collections of loci sometimes even may be unnecessary, for example when aiming to reconstruct phylogenetic histories of groups prone to hybrid speciation or polyploidization.
Despite ongoing optimization, for many tools intended for the analysis of allopolyploid origins, reticulation events in general, species delimitation, or species tree inference (e.g., allCoPol, Lautenschlager et al., 2020;BPP, Flouri et al., 2018), scalability towards high numbers of markers can still be an issue. In studies focusing on reticulation or polyploidization histories (Oxelman et al., 2017;Rothfels, 2021), the absence of paralogy in the nuclear genes under investigation is much more important if specialized inference tools are to be used.
Paralogs are pairs or groups of genes that started diverging by duplication instead of speciation (Fitch, 1970), for example, after a polyploidization event. Allopolyploidy on the other hand involves hybridization and results in pairs of genes called "homoeologous" per definition (Glover et al., 2016). Identifying parental homoeologs after an allopolyploidization event thus requires single-copy markers to preclude the confounding effects of paralogy.
While commercial probe kits for target enrichment intend to target low-or single-copy loci, uncertainty regarding the presence of paralogs increases with the taxonomic distance of a given study group relative to the lineage the probes were originally designed on, even when bait sets are regarded as "universal". As long as no published genome of the study group is available, establishing single-copy loci sets often is laborious and costly (but see Eserman et al., 2021).
In situations where (1) a smaller number of loci is sufficient for analysis, (2) single-copy gene information is preferred and (3) no reference genome is available while (4) financial resources may be limited, an elegant solution would be to combine reduced target enrichment efforts with a low-cost amplicon sequencing approach.
We here introduce a bioinformatic pipeline for mining single-or low-copy loci from a small-scale target capture experiment (using only few representative taxa), based on inherent characteristics of the target capture data pointing towards non-paralogy. The result is a phylogenetic marker set usable for elucidating complex phylogenetic relationships, for example, to disentangle hybrid speciation and polyploidization events; amplicons can be entered into NGS sequencing. Our pipeline is based on long-read sequencing via Oxford Nanopore Technologies (Oxford, UK; ONT), which can be independently performed even in small labs at comparatively low cost.
For groups where paralogy is particularly widespread, additional workflows are provided to enable identification and removal of traces of duplicated genes from the amplicon sequencing data. For this task, the Nanopore read assembly tool Canu (Koren et al., 2017) is employed as well as a tree-based orthology inference method, adapted from procedures introduced by Yang & Smith (2014).
We test our approach in the genus Leucanthemum Mill., a member of tribe Anthemideae in Asteraceae (Compositae). The family is characterized by rampant whole-genome duplication (WGD) and a pronounced history of polyploidization (Barker et al., 2016;Huang et al., 2016;Watson et al., 2020). In addition, for the hyperdiverse daisy tribes (also named the "Fab Five"), which Anthemideae is a part of, reticulation due to ancient hybridization has been inferred (Watson et al., 2020), which further complicates phylogenomic inferences. The genus Leucanthemum itself is a large polyploidy complex comprising 42 species with ploidy levels ranging from diploid (2x) to dodecaploid (12x); L. lacustre (Brot.) Samp. even has 2n = 22x = 198 chromosomes (Euro+Med, 2023). The average genome size of the diploid species is approx. 11.7 Gbp in the diploid phase (based on entries of L. vulgare, L. gaudinii, L. tridactylites, and L. laciniatum in the Plant DNA C-values Database, Leitch et al., 2019). A phylogenetic analysis of diploid Leucanthemum representatives, using amplified fragment-length polymorphism (AFLP) fingerprinting data and a multilocus species tree reconstruction (based on nine low-copy nuclear markers and the concatenated sequence information from five plastid intergenic spacer regions; Konowalik et al., 2015), corroborated earlier patterns

T A X O N O M Y C L A S S I F I C A T I O N
Genomics of relationships found by examining the External Transcribed Spacer (ETS) region of the nuclear ribosomal repeat (Oberprieler et al., 2014): the species could be classified into an ancient, paraphyletic group and a second, monophyletic group of species. A subsequent Bayesian species tree reconstruction with *Beast (Heled & Drummond, 2010) based on the same ten-locus marker set (Wagner et al., 2019) confirmed the already known bipartition of Leucanthemum diploids into a well-supported monophyletic group around L. eliasii, L. legraeanum, L. ligusticum, L. monspeliense, L. pluriflorum, and L. vulgare, and an unresolved, more ancient grade of the remaining diploids. This phylogenetic bipartition pattern was again found in a RADseq-based study (using Restriction site-associated DNA markers) by Ott et al. (2022), which mainly aimed at an integrative taxonomic treatment at the diploid level.
While for some of the polyploid species, their evolutionary relationships with diploid progenitors have been reconstructed (Greiner et al., 2012(Greiner et al., , 2013Oberprieler et al., 2011Oberprieler et al., , 2014, this has not yet been accomplished for the vast majority. Major obstacles have been the lack of a corroborated taxon delimitation based on established phylogenetic relationships at the diploid, "bottom" layer (but see Ott et al., 2022), but also the lack of a suitable molecular marker system for disentangling auto-and allopolyploid relationships in this young genus (crown age of diploids ca. 1-3 Ma; Wagner et al., 2019). This study was designed to lay the foundation to tackle these issues. Its aim is to (1) present a pipeline developed for mining single-or lowcopy nuclear markers from a small representative target capture dataset for subsequent long-read amplicon sequencing, (2) test the pipeline on a young, polyploid genus with a documented history of polyploidization and large genome sizes, (3) evaluate whether markers suitable for species tree reconstruction can be obtained using our approach, (4) test whether the selected markers show any signs of paralogy, and if so, (5) determine how these paralogous remnants can be removed.

| MATERIAL S AND ME THODS
For an easier overview of the numerous methods used in this study, Figure 1 graphically represents the workflow. Readers interested in applying the workflow to their own study group are advised to download two more detailed workflow figures alongside the detailed description of all wet-lab and data analysis methods from Dryad (see Data Availability Statement), where also more information is given for each paragraph and all auxiliary software tools are mentioned.

| Plant material and DNA extraction
Leaves of 43 individuals (accessions) from 20 Leucanthemum species were collected from 2004 to 2021 during field trips in France, Spain, Switzerland, Austria, Germany, Italy, Slovenia, Poland, Bosnia and Herzegovina, and Romania. The sampling thus includes all diploid species of the genus as accepted to date. Additionally, two outgroup samples from Rhodanthemum catanache and Chlamydophora tridentata were obtained from trips to Morocco and Cyprus, respectively. During the upstream target enrichment experiment, DNA from four Leucanthemum species was captured, each represented by a single accession; one of these accessions was also used for subsequent amplicon sequencing, the other three were replaced by other accessions. Voucher information and further details for all accessions are given in Table 1. Total genomic DNA was extracted using the CTAB protocol (Doyle & Dickson, 1987;, in most cases from silica geldried leaves (see Table 1).

| Target enrichment in four representative species
The target enrichment experiment was performed using the Arbor Biosciences myBaits capture kit "COS Compositae/Asteraceae 1Kv1"/"Compositae-1061", developed for the Asteraceae by Mandel et al. (2014). The "CompCOS loci" (COS: Conserved Ortholog Set) are based on alignments of Helianthus, Lactuca, and Carthamus Expressed Sequence Tags (ESTs) against Arabidopsis spliced gene models; the probe kit, consisting of 9678 baits, targets 1061 nuclear loci, which are conserved within Asteraceae, with several of them supposed to be single-or low-copy. The kit has been used successfully on various taxonomic scales, from species complexes across single genera to tribal relationships up to the Asteraceae tree of life (Jones et al., 2019;Mandel et al., 2017Mandel et al., , 2019Watson et al., 2020).
Captured DNA fragments were sequenced on an ONT MinION Mk1B device. As ONT does not provide an official protocol for multiplexed capture of more than one sample, we here present a new combined workflow comprising elements from the Nanopore "Sequence capture" protocol, the Nanopore "PCR barcoding genomic DNA" protocol, and the official myBaits manual.
Sheared and size-selected DNA (targeted fragment length: 1500-4000 bp) was barcoded for each sample, using the PCR Barcoding Expansion 1-12 (EXP-PBC001, see Table 1). The target enrichment was performed following the myBaits manual except for some modifications. As neither Arbor Biosciences nor ONT offered blockers for Nanopore libraries, we individually designed six blocking oligos, four of them also usable for post-capture multiplex amplification and consisting of the forward sequence of the used Nanopore barcodes alongside their 5′ flanking region ( Table 2, "BC0x_prime_block"), and two corresponding to the forward sequences of the two Y-arms of the barcode adapter BCA from the PCR Barcoding Expansion ("BCA_ block_1st/2nd"). Hybridization was performed at 65°C for 24 h (for a step-by-step account on the procedure refer to the "detailed methods" document at Dryad). Library prep was finished according to the Nanopore Sequence capture protocol using the Nanopore Ligation Sequencing kit (SQK-LSK109), and the library sequenced on a Nanopore MinION Mk1B using a FLO-MIN106 flow cell.

| Analysis of target capture data and choice of suitable loci for amplicon sequencing
In summary, assignment of reads to probe genes as well as assessment of intra-locus genetic variability and paralog separation is done by a sequential clustering approach that eventually leads to a certain number of representative sequences (i.e., cluster consensus sequences) for each locus and species. These sequences are then examined for properties regarded indicative of single-or low-copy loci (see Chapter 2.3.4). Many of the analyses and necessary data handling steps described have been carried out using BASH and/or AWK commands on a Linux workstation. These commands alongside explanations are provided at Dryad, together with parameters as applied for tools and software; the latter can also be found in the "detailed methods" document. Where specialized scripts were used, they are mentioned here; all of them can likewise be downloaded from Dryad as well as GitHub, together with an explanatory text file.

| Extraction of on-target reads
To determine on-target reads, the FASTA read files were locally BLASTed, using Blast+ v.2.12.0 (Altschul et al., 1990;Camacho et al., 2009), against the collection of source ESTs originally used to design the CompCOS capture probes (Mandel et al., 2014; hence referred to as "CompCOS ESTs"). From the collection of all reads for each individual, those with a BLAST hit were then extracted with the fasomeReCoRds application (Kent Utilities, n.d., download website see References).

| Clustering and assignment of reads to their respective CompCOS locus
The obtained on-target reads were subjected to a clustering process, separating reads divergent beyond a certain threshold, followed by the assignment of each cluster to its respective CompCOS locus. The clustering step was performed using VseaRCH v.2.15.0 (Rognes et al., 2016), including appropriate adjustment of the clustering threshold (CT) parameter. Among others, VseaRCH produced FASTA files with centroids and cluster consensus sequences of all clusters. The partially overlapping raw reads within each cluster were then aligned using the local version of lamassemBle (Frith et al., 2021; see also Katoh et al., 2019). After alignment, all cluster files were renamed according to the best CompCOS BLAST hit of their centroids (for details refer to the "detailed methods" document as well as file "4_workflow_commands.docx" at Dryad). As most ESTs received multiple clusters, the latter were numbered sequentially for each EST.

| Loci summary statistics and pre-choice loci
In order to enable a thorough examination of the loci, and also for calculation of the first of the four indices (see below), loci summary statistics on the clustering of each sample and across all individuals were calculated first, among others the numbers of clusters with less than five reads (here referred to as "low-coverage clusters").
With the above-mentioned statistics at hand, obviously unqualified loci were excluded in a first step to ease processing. For example, loci with a very high number of reads are unlikely to be single-copy, while a minimum of reads is needed for calculation of indices, and generally to reliably reconstruct marker sequences from Nanopore raw reads with relatively high error rates. Consequently, as a first approximation, loci with fewer than 100 and more than 1000 sequenced reads were excluded. Furthermore, only loci with reads present in three or more individuals and with at least one cluster (per individual) containing five or more reads were kept.
Clusters from these "pre-choice loci" were extracted for each individual (script 1_extRaCt_files.sH) from its respective folder containing all aligned, renamed cluster files (result of 2.3.2); singleton clusters (i.e., clusters with only one read) and low-coverage clusters were dismissed (see Section 4). For the remaining clusters, their consensus sequences were calculated from the alignments via lamassemBle, and the consensus sequences of all individuals combined locus-wise into common FASTA files. These FASTA files served as input for the scripts calculating indices 2-4 (see below).
2.3.4 | Choice of putatively non-paralogous loci, based on four specialized indices The crucial step in the analysis of the target enrichment data was the selection of the best loci given the Leucanthemum data and, more F I G U R E 1 Illustration of the workflow presented in this paper, for analysis of target capture data, extraction of putative single-/low-copy loci, analysis of amplicon sequencing data from 13 + 2 amplified loci, and phylogenetic analyses. Tools or custom scripts necessary for each step are given beside boxes, squares within boxes denote steps where BASH commands were used (available in file "4_workflow_commands. docx" at Dryad). Central steps of the pipeline are the clustering and the assignment of reads to their respective CompCOS loci, and the selection of suitable loci based on criteria 1-4 via the four indices depicted by blue diamonds (green arrows inside the diamonds signify whether high or low values are "good" in the respective index). Important steps during the second part (below the dashed line) are the mapping of NGS reads from amplified loci to references, which are based on cluster consensus sequences, and the de novo assembly of reads using Canu. Both steps include optimization procedures, which are described in detail in files 1-3 as available at Dryad. Steps intended to filter possible concatemers/chimeric sequences from the amplicon data are highlighted in orange. Finalized locus-wise alignments are used for calculating gene and species trees.
TA B L E 1 Taxa analyzed for the present study, alongside voucher information with localities, collectors and used Oxford Nanopore Technologies (ONT) barcodes.

TA B L E 1 (Continued)
generally, deciding on what are appropriate criteria for defining a "good" (i.e., single-or low-copy) locus. Given a collection of reads, it is not possible to infer single-or low-copy loci directly. For instance, the number of reads in each locus might not only be influenced by the amount of copies, but also by the amount of data sequenced, sequence properties, biased amplification, or degree of identity with the used probe. The number of reads per locus can thus only be used for very rough filtering of eligible loci (see above). Instead, indirect criteria have to be used. Here, we assumed a locus likely to be singleor low-copy if it met the four following criteria: 1. It has many reads per cluster; given a certain number of reads, few clusters in a locus can be (among other factors, e.g., sequence characteristics) a sign of low locus divergence; a low total number of clusters per se is not indicative, as few clusters could also be caused by a low total number of reads.
2. Ideally, all cluster consensuses from the same species are more similar to each other, than to clusters from another species (indicating the absence of paralogy that is sufficiently ancient for detection).
3. When calculating distance-based dendrograms from cluster consensus sequences, consensuses belonging to the same species cluster together, instead of being scattered all over the tree (similar to criterion (2) but representing a tree-based measure).
4. When calculating distance-based dendrograms from cluster consensus sequences, only few main clades/groups are found (indicating no to few paralogous groups).
These four criteria were assessed by calculating four specialized metrics (here called indices) for each of the 467 pre-choice loci. We provide three custom R scripts to accomplish this task.
The first index, corresponding to criterion (1), is the mean stan- Regarding criterion (2), we calculated a "k-mer-based similarity (KBS)" index, defined as the average pairwise k-mer distance between the several consensus sequences of ONE species, relative to the average pairwise k-mer distance between consensus sequences of ALL species. The index is calculated per species and provides the percent deviation given a locus-wide average distance between all consensuses. The underlying rationale is that in a single-copy locus, cluster consensuses from the same species will ideally be more similar to each other than to consensuses from another species. In such a case, the KBS index will be negative. Positive values on the other hand represent a situation where sequences of different species are usually more similar than those from the same species. To obtain a locus-wide quality measure, the mean KBS index of all four species can be used: in a "good" locus, where all species take on negative values, the overall mean KBS index will be low as well. Note: Barcode blockers (BC0x_prime_block) include the 5′ flanking region of the Nanopore barcodes, plus 8 random basepairs at the 5′ end (the 3′ flanking region of the barcodes remained unblocked). The barcode blockers were also used for post-capture amplification of the library; the blockers BCA_block_1st and BCA_block_2nd (blocking the sequences of the two Y-arms of the Nanopore Barcode Adapter BCA) were prevented from acting as primers by being phosphorylated at their 3′ ends.
TA B L E 2 Oligonucleotides used for blocking barcode and adapter sequences during the in-solution hybridization.
Computation of indices for criteria (3) and (4), as well as visual inspection of chosen candidate loci (see 2.3.5), required dendrograms to be computed for each locus. These were generated via single-linkage hierarchical clustering of the aforementioned k-mer distances between cluster consensus sequences in a given locus.
Single-linkage clustering is recommended here as it enables clustering of shorter consensuses (representing different fractions of an EST) together with longer ones that cover the whole EST and thus helps reduce the effect of artificial splitting. Computation of dendrograms was done using the aPe (Paradis & Schliep, 2019) package in R via Rstudio (R Core Team, 2021;RStudio Team, 2020) and is part of the script calculating indices 3 and 4 as well as of script 6_CalCulate_dendRogRams.R. The dendrograms of three exemplary loci are shown in Figure 2.
Criterion (3) is based on similar theoretical considerations as criterion (2), but similarity here is evaluated using dendrograms instead of mere k-mer distances. Here, the goal is to detect (and dismiss) loci where the sequences (leaves) of a given species are scattered across the locus' dendrogram instead of being grouped together. This is achieved by measuring the uncertainty to which clade an individual belongs by means of the Shannon entropy. However, it is required that the presumable number of main clades in a dendrogram is known. The script 4_index3entRoPy_index4silHouette.R provides this information (see below) and also performs entropy computation in the following way: Let I be the number of individuals and n ik denote the number of leaves (= consensus sequences of VseaRCH clusters) from individual i, which are assigned to clade k. Then, the entropy of the As a ranking criterion, the entropy was averaged across individuals using each individual's number of consensus sequences as weights: In this index 3 again, low values argue against paralogy.
As mentioned above, criterion (4) was likewise assessed by script 4_index3entRoPy_index4silHouette.R. This was done by calculating the silhouette coefficients (Kaufman & Rousseeuw, 1990) for various cuts through a given dendrogram, and considering the best value of K (corresponding to the highest coefficient) as the presumable number of main clades (e.g., K = 3 in Figure 2b). A high optimal K may indicate the presence of high sequence divergence and likely multiple Exemplary dendrograms as used during computation of indices 3 and 4 and for visual inspection, from three CompCOS loci targeted in the target capture experiment (a: At1g01050, contained in the pre-choice loci but not in the proposed loci; b: At2g28315, c: At3g05230, both contained in the amplified loci). Leaves are consensus sequences of VSEARCH clusters assigned to the respective locus, and dendrograms are based on single-linkage hierarchical clustering using pairwise k-mer distances. Leaf names contain the number of the respective cluster as well as the Nanopore barcode of the individual the reads belong to (BC02-BC06). In most cases, each individual is represented by several clusters (as highlighted in red for BC04, L. monspeliense). Corresponding index values of the respective markers are given below each dendrogram. index 1: mean standardized number of reads per cluster; index 2: k-mer-based similarity; index 3: mean entropy; index 4: silhouette coefficient (best K).
paralogs in a locus that, therefore, should be dismissed. As the script for ultimately selecting the candidate loci (see below) uses percentiles of values from continuous indices, the discrete silhouette index 4 could not be sensibly incorporated into the ranking for the "best" loci. Instead, we used it as an additional criterion to exclude possible paralogous loci from the collection of candidates.
In a final step, the top-scoring loci were extracted using the script 5_Best_loCi.R, based on locus data for index 1, 2 (mean and standard deviation), and 3. We defined a "good" locus as one which performed well in ALL indices. The script ranks loci according to their performance in each index, by calculating their respective percentiles, and outputs those loci, which are among the best 1%, 2%, …, 99% of all loci (i.e., up to the respective percentile threshold) with respect to all index values. Naturally, in this way, the number of proposed loci increases with higher percentages/percentiles thresholds; this means that from the output, a suitable "proposal threshold" can be chosen depending on how many loci are needed. Candidate loci with a best K value of 5 or more in the silhouette index calculation (index 4) were excluded from the collection (in a tree containing four species, a maximum of four main clades might theoretically be explained by strong phylogenetic signal influencing the best K).

| Assessment of candidate loci
To evaluate the ability of the indices to predict single−/low-copy loci in the dataset, and to further refine the choice of loci prior to wet-lab screening, dendrograms were computed for visual inspection for all obtained candidate loci using the script 6_CalCulate_dendRogRams.R.
Furthermore, in each locus's FASTA file (with cluster consensus sequences of all individuals), the sequences were aligned alongside the respective source ESTs using again lamassemBle. As substantial divergence occurred among consensus sequences of some markers, these alignments were unsuitable for further analysis and were only used for rough visual inspection and primer design, and, modified and processed further, for generation of references for mapping, see 2.4.4.
Upon inspection of dendrograms and alignments, loci with strongly divergent consensus sequences within or among individuals, or conspicuous inter-individual groupings leading to dendrograms comprising long basal branches or several clades with individuals' sequences scattered across the tree, were excluded from further processing.

| Primer design and PCR screening
All loci for which primers could be designed were tested in a PCR screening approach (for details on this procedure as well as others described in the entire Chapter 2.4. (Amplicon sequencing) refer to the "detailed methods" document at Dryad). Markers producing clear multiple bands in two or more individuals were regarded unsuitable and dismissed, as well as markers failing to amplify any product. All remaining markers can generally be regarded as acceptable, but for the purpose of the present study, markers for which primers failed to amplify clear bands in one or two individuals in the screening, or markers passing the PCR screening but having only low variability were excluded as well. Average values for indices 1, 2 (mean and standard deviation), and 3 were calculated for accepted versus non-accepted markers separately and the results examined for differences. Trends were tested for significance for each of the index values separately using one-sided Mann-Whitney U tests (coding see Table 3).

| Mapping and filtering
Demultiplexed reads from each individual were then mapped to all loci using minimaP2 v.2.21 (Li, 2018(Li, , 2021. Conveniently, mapping references are based on the already existing locus-wise cluster consensus sequence alignments comprising all individuals (see Chapter 2.3.5), after some processing also involving script 7_VseaRCH_testClus-teRingtHResHolds.Py.
In a next step, the mappings were evaluated in various ways to identify possible problems with the mapping procedure. Additional mapping references were generated where appropriate and the respective mappings repeated. Used tools for these steps comprised TA B L E 3 Proposed candidate loci alongside additional information.

Locus (no.)
Forward primer (5′-3′)    are given. Aligned length and estimated variability ("estim. variab.") referable to cluster consensus sequence alignments used for primer design, cut to marker length. "screen: ind. mult. bands": number of individuals with multiple bands in the PCR screening, alongside number of PCR bands in brackets. "status after PCR screening": the category of each marker as defined in the text; markers #50 and #62 (marked by asterisks) were amplified for comparison purposes. For the Mann-Whitney U test, accepted and amplified markers were coded as "yes", all other markers as "no". Marker #51_2 was not included into calculations. "amplif.: ind. mult. bands": number of individuals with multiple bands during amplification of all 42 accessions; bold type and asterisks denote those where multiple bands are likely due to paralogy, underlined numbers are loci eventually producing at least one secondary contig in any individual during de novo assembly. Columns "no. seq. approach A-C/concat. ML" denote the number of sequences included in the gene trees used for the species tree and concatenation (maximum likelihood) approaches (split loci given separately
Three approaches were tested regarding the tree input for species tree generation. First, gene trees computed from unmodified markers were used as input ("unmodified" approach A). In the other two approaches, efforts were made to diminish the effects of possible paralogy still present in the chosen markers, based on strategies outlined in Yang & Smith (2014). Briefly, in the second approach, markers featuring at least one accession with >1 contig (here referred to as "secondary contigs") were inspected for exceptionally long branches and divided into two subsets of accessions where necessary ("split paralogs" approach B). In the third approach, ALL markers from the second approach B were closely examined for possible evidence of remaining paralogy, by the help of gene trees, as well as NeighborNet splits graphs (Bryant & Moulton, 2004).
Markers with exceptionally long branches were again divided into subsets. Also, where necessary, presumably paralogous elements of the trees were removed using the Rooted ingroups (RT) strategy as described by Yang & Smith (2014). The result of this process are "cleaned" subtrees with only one contig per accession; the "cleaned" gene trees were input into a third round of species tree reconstruction ("Yang and Smith" approach C). For details on the RT strategy as well as approaches A-C refer to the "detailed methods" document at Dryad. For the species tree from the "Yang and Smith" approach C, conflicts among the underlying gene trees were examined using PHyPaRts v.0.0.1 (Smith et al., 2015) with option -a 1 (thorough conflict analysis) set. Visualization of the results on the species tree was done using PHyPaRtsPieCHaRts (Johnson, n.d., download website see References).
In addition to the quartet-based summary method as implemented in astRal, a concatenation-based (maximum likelihood, ML) approach was tested as well, again employing IQ-TREE. The "cleaned" datasets as created for the "Yang and Smith" approach C were used in order to avoid improper correlation of secondary contigs with identical names across different datasets. Marker alignments were concatenated into one single FASTA file and input into IQ-TREE using the same settings as for gene tree calculations. In the resulting tree, nodes with bootstrap support <50% were collapsed using TreeGraph v.2.15.0 (Stöver & Müller, 2010).  Table 4. From the filtered reads, app.

| Target enrichment data analysis
34% were found to be on-target according to the local megablast search, that is, pertaining to one of the CompCOS ESTs. Testing for the most sensible threshold for clustering reads (Table 5) provided evidence for a CT around 0.90 constituting a turning point, from whereon cluster numbers steeply increase while cluster maximum sizes drop. On the other hand, consistently more ESTs were found using higher CTs; however, obtaining a few percent more ESTs is unlikely to outweigh the considerable difficulties analyzing a triple amount of clusters. Considering that the CT should also reflect the raw error rate of Nanopore reads (app. 12% with the used chemistry), above which clusters might be separated based on mere artifacts, the CT was chosen at 0.88. Results from the clustering procedure for all four individuals are given in Table 6. All cluster centroids could be assigned to a CompCOS EST.

| Loci summary statistics and pre-choice loci
According to loci summary statistics (condensed information in Among pre-choice loci, averaged across all loci and individuals, 77% were low-coverage clusters (1-4 reads; variation 42%-93% for a single locus); only 16% of the clusters had 10 reads or more (4%-50%).

| Choice of putatively non-paralogous loci and assessment of candidate loci
Minimum, maximum, and average values for indices 1-4 in the prechoice loci are reported in Table 7. At a proposal threshold of 15% (loci among best 15% regarding values in all indices), six loci were proposed by script 5_Best_loCi.R; at lower values, no locus passed the filter (Figure 3, "proposed loci"; detailed account on loci proposed at different thresholds is available at Dryad). Only four more loci were added until 27% proposal threshold, while at higher thresholds the number of proposed loci steadily increased. To obtain a reasonable number of loci for testing, we chose the proposal threshold at 50%, where 53 loci were proposed ("candidate loci"). These candidate TA B L E 4 Read statistics as obtained after demultiplexing of data from the target enrichment experiment with four individuals.

| PCR screening, accepted and amplified loci, amplification results
For three loci, primers could not be designed, the remainder was tested in the PCR screening ("tested loci"). Of these, 23 loci were regarded as acceptable in principle ("accepted loci", see Table 3).
For the purpose of the present study, five more loci were dismissed for failing to amplify in one or two individuals, and another five for having too low variability. Thus, a total of 13 loci was available for amplicon sequencing ("amplified loci"). A graphical representation of the amount of proposed, accepted, and amplified loci as present at different proposal thresholds is shown in Figure 3. It is apparent that the steep increase in the amount of proposed loci from the ca.
40% percentile threshold on is not followed by the accepted loci. On average, compared to non-accepted loci, accepted loci were characterized by lower (=better) values for index 2 and 3 (

| Amplicon sequencing data analysis
The two runs of library

| De novo assembly using Canu
In four out of 569 assemblies, no contig was assembled (all Canu assembly details available at Dryad); the assemblies were repeated with modified settings (increased readSamplingCoverage parameter or estimated genome size, and/or setting corMinCoverage to zero), whereby contigs were obtained. From five assemblies with <120 input reads executed manually with adapted settings, only one failed (Rhodanthemum for marker #56). Twenty-four problematic assemblies were recognized and checked visually. Altogether, 66 secondary contigs were assembled, mostly in markers #45, #54, and #62, three markers with suspected hidden paralogy (see 3.4).
Due to unknown reasons, not a single secondary contig was assembled in marker #9A, despite 17 individuals with two or even three bands in the PCR indicating possible shorter as well as longer copies.
Generally, most problems were found in marker #62 (and to a lesser extent also marker #56). All assemblies of marker #62 were thus repeated with an increased estimated genome size to account for expected paralogs, plus lowered minReadLength/minOverlapLength to include shorter marker variants. Altogether, only one out of 574 assemblies failed after all optimizations.

| Gene tree/species tree reconstruction
Of the 13 amplified markers plus markers #50 and #62, six had at least one secondary contig assembled in an individual (i.e., markers #45, #51_1, #54, #57, #58, and #62; see Canu statistics table at Dryad). Species tree reconstructions were done using the 13 markers yielded by the pipeline, plus marker #50 to add more information to the dataset; the paralog marker #62 was not included. All marker alignments as used for approaches A-C, for NeighborNet calculations and as basis for the concatenated matrix, also marker #62, are also available at Dryad, as are the gene trees from the unmodified markers. The astRal species tree based on these markers ("unmodified" approach A) is shown in Figure 4b. Regarding the "split paralogs" approach B, gene trees from markers #45 and #54 (and expectedly, although not included, #62) showed a very distinct TA B L E 7 Statistics of enriched, pre-choice and proposed candidate loci in all four individuals used for target enrichment. Avg.
--3.29 2.62 Note: Minimum and maximum values across all loci are given alongside average values. "Clusters with 1-4 reads" are low-coverage clusters as defined in the text. For index 2 (k-mer-based similarity, KBS), statistics are given separately for its mean value and standard deviation. "--" indicates that the respective value was not obtained.
pattern of two subgroups subtended/separated by one or two long branches, representing a clear sign of paralogy. All three markers were split into two. In #51_1, three individuals were subtended by a single long branch and were removed. The species tree based on the resulting 16 markers including split markers #45_p1 and #45_p2, and #54_p1 and #54_p2, and the corresponding modified gene trees alongside NeighborNet splits graphs are available at Dryad. In the next step, these gene trees and splits graphs were closely examined for signs of remaining paralogy and leaves pruned where necessary (see Table 3), following the approach as described in the "detailed methods" document at Dryad (modified after Yang & Smith, 2014).
All markers still having secondary contigs (#45_p1, #51_1, #57, and #58) were modified, plus markers #9A and #50. Marker #13A was dismissed completely due to a combination of two long branches and too few remaining leaves in subtrees. Thus, 15 loci were used for creating the species tree according to approach C ("Yang and Smith"), which is presented in Figure 4a. of standardized pipelines to analyze long-read target capture data (Andermann et al., 2020). The central step in processing the data is the assembly of reads as well as the assignment to their respective gene as represented in the probe set. For Illumina data, this can be automated using, for example, the HyBPiPeR pipeline (Johnson et al., 2016), which maps or BLASTs reads to target loci first, then performs de novo assembly of mapped reads and finally extracts exons and/or introns from the assembled contigs. HyBPHylomaKeR (Fér & Schmickl, 2018) by contrast maps reads to a pseudo-reference previously generated from concatenated probe sequences, and consensus calling based on mapped reads is performed; the obtained consensus is then fragmented into its exonic parts.
The two main strategies for read assignment employed by these two pipelines (1: mapping, 2: BLASTing reads) have also been applied independently, mainly using Illumina but also Nanopore data.
(1) Nanopore target reads were mapped against a reference, for example, by Eckert et al. (2016), but this strategy is not possible for the present study due to the lack of a genomic reference for Leucanthemum. Alternatively, for a set of probe loci, a pseudoreference as in HyBPHylomaKeR might be used; however, the larger and less standardized read length of Nanopore reads renders this a quite unfunctional task.
(2) BLASTing reads against probe sequences is frequently employed for the identification of ontarget Illumina and Nanopore reads (Giolai et al., 2017;Mandel et al., 2014) and does not necessitate any other references for read assignment; thus, we chose to follow this strategy. To extract reads of interest while enabling assessment of intra-locus variability and separation of putative paralogous sequences as far as possible, we employed a sequential clustering strategy, first extracting reads belonging to the CompCOS loci via BLAST (adapting settings by Giolai et al., 2017) and then clustering these on-target reads with VSEARCH; assignment of each cluster to a CompCOS locus was subsequently done via another BLAST search. In contrast to de novo assembly as performed by HyBPiPeR, we then built lamassem-Ble consensuses from the aligned reads of each cluster to obtain representative sequences (see below), similar to the approach in HyBPHylomaKeR.

F I G U R E 4
Species trees for the genus Leucanthemum inferred with astRal, based on IQ-TREE gene trees from (a): 15 locus alignments from the "Yang and Smith" approach C, and (b): 14 locus alignments from the "unmodified" approach A. Chlamydophora tridentata was set as the outgroup. Local posterior probabilities above .5 are given above branches. Values below branches are results from the PHyPaRts analysis and denote the numbers of gene trees, which, at the respective node, are concordant with the species tree or with a conflicting bipartition, respectively. In some cases, the same gene tree can support both the species tree and conflicting topologies. Branch lengths are in coalescent units, branches of Chlamydophora and Rhodanthemum are shortened and not to scale. The clade representing the L. vulgare-group is indicated by arrows.

| Clustering and cluster consensuses
Using clustering as means of capturing locus diversity is not completely free of issues. Ideally, each single-copy locus is expected to yield one or two clusters in a diploid. However, many more clusters were usually found for loci as well as ESTs, for example, an average of eight (in L. rotundifolium) to 30 (in L. monspeliense) clusters per locus per individual; the amount of clusters seemed to be positively correlated to total read numbers. The reasons for this are apparent upon inspecting read/cluster consensus alignments of a locus. Two of those are available, for a "very good" and a "very bad" locus (At3g05230 and At1g01050 as presented in Figure 2), at Dryad due to their large size. Cluster consensuses are aligned alongside reads of singleton and low-coverage clusters, which were excluded from most computations. The gappy, "perforated" alignment even in exonic regions (marked by the three EST sequences) of the "good" marker At3g05230 shows that much of the variability found among reads and also cluster consensuses is caused by Nanopore sequencing errors, which thus have the greatest influence on the cluster generation process. Due to the chosen high threshold (which is necessary to avoid lumping closely related paralogs into the same cluster), singleton clusters are expected to be numerous, which was also observed here. The occasionally bad quality of read beginnings and ends might raise their number even further, but this problem could be alleviated by more rigorous character-based quality trimming. Systematic Nanopore errors (e.g., from homopolymers) are expected to result in low-coverage clusters, which are, however, easily filtered (see below). Artificial cluster splitting due to non-overlapping cluster subregions/reads was not observed and seems to be a minor problem, but cannot be ruled out entirely; splitting might happen if a locus lacks long enough reads that act as conjunctive elements in the clustering of shorter reads. Altogether, the high degree of baseline variability (i.e., background noise) found even in the "good" locus is mainly caused by errors and artifacts and does not contain any biological meaning. Newer ONT chemistries that have become available recently and provide greater raw read accuracy (Q20+ and higher) will help to largely remove this background noise.
In the "bad", that is, presumably paralogous locus At1g01050, the same effects are present; however, a much greater variability even among consensuses/reads of the same individual is observed.
The traces of presumable paralogs can be seen in (consensus) sequences which are barely alignable outside the exonic regions (due to sequence divergence) and/or the respective gene (due to paralogs being located next to different neighboring genes). In At1g01050, three different sequence variants, each represented by at least two consensuses, can be identified outside and within exonic regions and in consensuses as well as raw reads. These variants are traceable even through the disturbing influence cast by the sequencing errors and correspond to the three main clades of locus At1g01050 in Figure 2a. It seems very clear that this strong signal is responsible for the very bad evaluation of the locus by the pipeline.
To simplify data handling and to attenuate the negative effects of Nanopore raw read errors, for calculation of indices 2-4, singleton and low-coverage clusters (with less than five reads) were removed as likely representing artifacts. As the quality of a consensus decreases with the number of reads it represents, setting the minimum reads per cluster to a value >5 would be desirable; however, the trade-off between removing errors but no true signal from the dataset always has to be considered. Non-filtered clusters were reduced to their consensus sequences, thus eliminating a considerable portion of random sequencing errors. The concept thereby is to consolidate the total variability found in reads of a locus (regardless of its causes) into few representative sequences, which can then be searched for signs of paralogy via indices 2-4.

| Choice of single−/low-copy loci via specialized indices
The pre-selection procedure of loci prior to index calculation was carried out considering mainly pragmatic reasons. However, dismissal of loci with too few (<100) reads resulted in exclusion of 30% of enriched loci from any further consideration. As single-copy loci are expected to be represented by a low number of reads a priori, it is possible that suitable loci were discarded by setting this threshold.
This problem could be alleviated by increasing the total number of reads sequenced, for example by aiming at 400,000 passed reads per individual.
Altogether, the pipeline performed well in predicting low-copy loci in Leucanthemum. Almost all loci proposed at low proposal thresholds (up to the max. 27% percentile) turned out to be acceptable for use ( Figure 3). Limiting manual examination of loci to those included in the 30% proposal threshold instead of the chosen 50% would have yielded a proportion of about two third acceptable loci.
A short evaluation of the performance of each index is provided below; generally, for all indices, better values were observed in the "candidate loci" relative to the "pre-choice loci" (see Table 7 and file "6_full_locus_statistics.xlsx" at Dryad).

| Index 1 (mean standardized number of reads per cluster)
Results from this index were somewhat equivocal: there was almost no difference in values between accepted versus non-accepted markers (Table 8). A possible problem is that the index generally penalizes loci with a small total number of reads. This will disadvantage single-copy loci characterized by few reads per se. Indeed, very bad values (<−1) for index 1 are exclusively assigned to enriched loci with 1-104 reads in the present dataset. Loci with few reads can also score very well; on the other hand, however, loci with a larger number of reads will never score too bad. Also, the explanatory power of the index can be affected in cases where the number of clusters is increased artificially, for example due to Nanopore errors or a random lack of long reads in a locus/EST. These facts might explain the somewhat limited utility of the index.

| Index 2 and 3 (k-mer-based similarity, KBS, and mean entropy)
Both indices perform well in predicting low-copy loci from the Leucanthemum dataset, especially the mean entropy, where the difference among values in accepted versus non-accepted loci (within the 53 loci proposed by script 5_Best_loCi.R) is highly significant (Table 8). Both indices, however, suffer from the fact that individuals with only one consensus always will be scored best (−100 and 0, respectively, see Figure 2c). be accounted for. A disadvantage is that two short sequences, which belong to the same paralog but cover different parts of its sequence, will artificially deteriorate the KBS index. Index 3 is less affected by this problem due to being based on dendrograms calculated using single-linkage clustering. A possible solution to this problem in the KBS index would be to revert it to the concept it was originally inspired by, that is, the LB score (long-branch score) presented by Struck (2014). This score is similar to the KBS index, except for the fact that it uses patristic distances for calculation and thus, similar to index 3, is based on an underlying tree. Originally designed to detect long-branch attraction within a phylogenetic tree, the LB score is defined as the mean pairwise patristic distance of a taxon to all other taxa in the tree, relative to the average pairwise patristic distance across all taxa. It is, however, used on trees with one single accession per taxon and calculated for each leaf of the tree, so is not easily applicable here. Anyway, results for index 2 in its present shape already seem to match nicely the conditions as seen in the respective dendrograms of the loci (compare Figure 2): in marker #52, the sequences of all individuals cluster quite well but are arranged in grades in two of four individuals. The corresponding KBS is slightly increased compared to the "perfect" marker #41, in both mean and standard deviation.

| Index 4 (silhouette coefficient)
An inherent problem with this index is that the silhouette coefficient is defined for ≥2 clusters only. Consequently, a perfectly homologous locus can never receive a best K value = 1 (see K = 2 in This would argue for extending the proposal threshold beyond 30% when investigating proposed loci. Marker At1g01050, which scored rather badly in general and very badly for index 3, is proposed only at the 94% threshold and is certainly not to be considered in studies relying on single-/low-copy loci.

| Amplicon sequencing
Several tools for automatic handling of Nanopore amplicon data are available. Examples are nanoRtax (however limited to 16S rRNA amplicons only and intended for taxonomic analysis of microbial community samples, Rodriguez-Perez et al., 2021) or deCona (for processing data from demultiplexing to consensus calculation, Doorenspleet et al., 2021). However, none of these tools can adequately handle paralogs present in amplified loci. Thus, amplicon data here were processed manually, by first mapping the reads to references of the amplified loci and then de novo assembling reads of each locus and individual separately.
To obtain suitable mapping references from the target capture data and thus ensure effective mapping of divergent taxa, a major part of a group's variability should be (preferably evenly) represented in the individuals used for target enrichment. An alternative to de novo assembly of mapped reads is variant calling from the mapping and subsequent creation of a modified reference based on the called variants. While this can represent a suitable alternative to de novo assembly in several cases (Scheunert et al., 2020), it was found to be inappropriate in the present study (results not shown). target capture data with HyBPiPeR, also noted the problem of inadvertently merging paralogs with high sequence similarity into one. In plant groups prone to hybridization and with recent allopolyploidy or even homoploid hybrid speciation, another general drawback is that, apart from paralogous copies, homoeologs might also be assembled, which confounds the correct detection of remaining paralogy.
Although only diploid taxa were included in the present study, an influence exerted by potential hybrid speciation and/or introgression in single taxa cannot be ruled out.
4.5.3 | Tree-pruning as means of eliminating paralogy from datasets, and the issue of missing data Apart from dismissing whole loci due to presumed paralogy, separating or pruning supposedly paralogous elements in gene trees is an alternative method of ensuring orthology in a dataset, and was adopted, for example, by Liu et al. (2019) and Karimi et al. (2020). In the present study, tree-based pruning was used to remove remaining traces of paralogy, as per our definition detectable via secondary contigs but also non-monophyletic outgroups. We hereby followed Yang & Smith (2014) and their RT method (see Chapter 2.5).
Approaches by the latter authors were also employed by Morales-Briones et al. (2022), however, using Illumina data.
Unfortunately, employing the RT method in the way described here resulted in a substantial increase of missing data in the markers that had to be modified, with gene trees losing an average of 27.3% of their leaves. It has been argued that species tree inference methods like ASTRAL are robust to missing data (e.g., Nute et al., 2018;Xi et al., 2016); indeed, the species tree based on gene trees from the "Yang and Smith" approach yielded a similar topology as that based on the "unmodified" approach. (2) Employing sufficient sequencing depth per individual will alleviate problems related to low-and lower-coverage clusters and to lack of data in general, and will also provide enough reads in single-copy loci to prevent their inadvertent exclusion.
(3) If longer loci than the ones produced here are desired, the DNA shearing, size-selection procedures, or Nanopore library prep in the target capture experiment can be modified accordingly, to eventually produce longer sequenced reads.
This would also lead to both better and longer consensuses, improving calculation of indices 2 and 3 and reducing the risk of artificial cluster splitting even further. (4) The amount of loci obtained by the pipeline is to some extent adjustable by the user. In Leucanthemum, the proportion of acceptable loci decreases as more and more loci are proposed, but even at higher proposal thresholds, good loci can be obtained, at the price of more laborious manual examination and PCR-testing. How to deal with this trade-off depends on the study envisioned, and might, of course, vary in different plant groups. In principle, the approach for identifying single-copy loci from target capture data as presented here would be also feasible for use with long reads from Pacific Biosciences (PacBio) sequencing. Some results might even be improved, for example, regarding the clustering process or the performance of index 1; this would, however, come with the disadvantage of higher sequencing costs. The analysis of such data additionally would require tools tailored to PacBio or Illumina data in some cases.

| Phylogenetic inferences
One aim of the present study has been the establishment of a suitable molecular marker set, which could be used for the reconstruction of phylogenetic relationships among the diploid representatives of Leucanthemum, and possibly also for disentangling the phylogenetic history of its polyploid lineages in future studies.
Species tree reconstruction based on multilocus sequence information is nowadays accomplished either by concatenation of individual marker alignments and a "total evidence" analysis with gene tree reconstruction methods (e.g., maximum likelihood, Bayesian inference), or with more sophisticated species tree reconstruction methods, for example, based on coalescent theory. It has been asserted that the concatenation approach may produce robust and well-supported, but inaccurate phylogenetic reconstructions (Kubatko & Degnan, 2007;Weisrock et al., 2012). This observation is corroborated by the present study: while an overwhelming number of monophyletic groups receive high support in terms of bootstrap supports in the concatenation-based maximum likelihood tree (available at Dryad), its topology (especially regarding the position of members of the ancient grade of taxa here placed within a derived, monophyletic clade, while by contrast the L. vulgare-group lacks monophyly) is not only in sharp contrast to the astRal species trees (Figure 4), but also to all other hitherto published phylogenetic reconstructions based on alternative marker sets, that is, nrDNA ETS (Oberprieler et al., 2014), plastid and low-copy nuclear markers (Konowalik et al., 2015;Wagner et al., 2019), and 9248 RADseq loci (Ott et al., 2022). Therefore, the species tree reconstruction based on astRal appears more trustworthy, despite the disappointingly low support values for many of its monophyletic groups. It has to be concluded that approaches based on only few markers, as the one presented here, are unsuitable for effectively reconstructing evolutionary relationships among diploid Leucanthemum species; this is better accomplished using RADseq data (as shown in Ott et al., 2022).
Both astRal species trees in Figure 4 corroborate the wellknown bipartition of Leucanthemum diploids into two groups: the ancient, paraphyletic assemblage of L. lithopolitanicum through to L. laciniatum and L. tridactylites, and the closely knit, monophyletic group around L. vulgare. The results from the PHyPaRts analysis, revealing large amounts of discordance among gene trees even in nodes supporting well-established relationships, suggest that low supports in the species tree are indeed caused by gene tree conflict (although, especially within the L. vulgare-group, a lack of phylogenetic signal also seems to play a role). This conflict may be attributable to ILS and/or gene flow across lineages (hybridization among species or homoploid hybrid speciation). Konowalik et al. (2015) proposed the latter as the main accountable factor; however, the assumed single-copy nature of nine of the loci used in that study (previously proposed by Chapman et al., 2007) has never been verified for Leucanthemum; the gene tree incongruence observed may thus also have been due to artifacts caused by unrecognized paralogy. The overall concordance observed among the plastid tree and the nuclear species tree in Figure 4a, however, could argue for ILS rather than hybridization as responsible factor for the nuclear gene tree incongruence, given also the short internal branches relative to long terminal ones in the paraphyletic and ancient group and overall short branches in the monophyletic, more recently diverged L. vulgare-group in Figure 4a. Retention of ancestral polymorphism could have been caused by two epochs of fast speciation events (radiations) in the evolutionary history of the genus, one in an early phase with the onset of the glaciation cycles in the Early Pleistocene (Gelasian or Calabrian, 2.58-0.8 million years ago) and the other in a more recent one around 500,000 years ago (Chibanian; see chronogram in Wagner et al., 2019).

| CON CLUS IONS
In the present study, we introduced a new method for mining target enrichment data from a genus of interest for single-copy nuclear loci. The presented pipeline renders this a manageable task at comparatively low cost. Our approach is based on Nanopore read data, leveraging the power of long reads, which are essential for exact identification of paralogous copies, together with straightforward sequencing even in very small labs. We have shown that sequential clustering (BLAST plus VSEARCH) provides an elegant way of assigning captured reads to their probe sequences and to assess locus variability. The four presented indices for the extraction of suitable loci proved effective, with some limitations only for index 1 (number of reads per cluster). For detection of remaining paralogy in the chosen, amplified markers, Canu de novo assemblies of mapped amplicon reads were also employed. Our results show that although Canu does not perform 100% correctly in generating representative contigs for all presumed paralogous copies contained within a given marker's reads, it is a good indicator of the presence of paralogy, and its performance is likely to still increase in older lineages. Subsequent tree-based pruning of suspected paralogous elements can result in substantial amounts of missing data in groups with generally high levels of WGD like the Asteraceae. In families not as intensely influenced by polyploidization, our pipeline can be expected to yield a larger number of suitable loci requiring less post-sequencing modification with regard to remaining paralogy. In difficult genera like Leucanthemum, conflicting phylogenetic signal among the obtained markers might lead to problems in downstream analyses aiming to reconstruct diploid relationships (as seen here), and thus likely also in analyses examining allopolyploid origins. In the latter case, using only an optimal subset of the found markers (those which did not require modification after amplification and sequencing-six loci in this study) might help to enable successful analyses; as demonstrated in different studies (Freyman et al., 2023;Jones et al., 2013), very few markers can be sufficient for this purpose.

ACK N OWLED G M ENTS
This study was supported by grants from the German Research Härtl for technical assistance and invaluable support in the amplicon sequencing project. Thanks are also due to Anja Heuschneider for friendly advice regarding the lab work. We are grateful to Salvatore Tomasello for helpful discussions in the development of the indices and beyond. We would also like to thank Marco Dorfner for help in performing the target capture and for numerous enlightening discussions. Two anonymous reviewers are acknowledged for their valuable comments, which considerably helped to improve the manuscript. Open Access funding enabled and organized by Projekt DEAL.

CO N FLI C T O F I NTER E S T S TATEM ENT
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All base-called, passed, demultiplexed raw reads generated in the target enrichment experiment and from amplicon sequencing are